This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.

You can find the complete handbook on Github

Pivoting

Overview

When manipulating data, pivoting can be understood to refer to one of two processes: 1. the creation of pivot tables, which are tables “… of statistics that summarize the data of a more extensive table (such as from a database, spreadsheet, or business intelligence program). This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way… They arrange and rearrange (or”pivot“) statistics in order to draw attention to useful information. This leads to finding figures and facts quickly making them integral to data analysis.” https://en.wikipedia.org/wiki/Pivot_table#.

  1. The conversion of a table from long to wide format, or vice versa.

The former is a crucial step in data analysis, and is covered elsewhere (link to dplyr, summarise, etc). In this section, we will focus on the latter definition.

https://datacarpentry.org/r-socialsci/03-dplyr-tidyr/index.html

Wide-to-long

Transforming a dataset from wide to long

Data

Data are often entered and stored in a format that might be useful for presentation, but not for analysis. Let us take the count_data dataset as an example, which is stored in a “wide” format, which means that each column is a variable and each row an observation. This is useful for presenting the information in a table or for entering data (e.g. in Excel) from case report forms. However, these typically needs to be transformed to “long” format in order to analyse and visualise.

DT::datatable(count_data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Each observation in this dataset refers to the malaria counts at one of 65 facilities on a given date, ranging from 2019-03-18 to 2019-06-14. These facilties are located in one Province (North) and four Districts (Spring, Bolo, Dingo, and Barnard). The dataset provides the overall counts of malaria, as well as age-specific counts in each of three age groups - <4 years, 5-14 years, and 15 years and older.

Visualising the overall malaria counts over time poses no difficulty with the data in it’s current format:

ggplot(count_data) +
  geom_col(aes(x = data_date, y = malaria_tot))

However, what if we wanted to display the relative contributions of each age group to this total count? In this case, we need to ensure that the variable of interest (age group), appears in the dataset in a single column that can be passed to {ggplot2}’s “aesthetics” (aes()) function.


Consider also using the common problem whereby data are stored with dates as the columns, as in tidyr::table4a

tidyr::table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

pivot_longer()

First, let’s begin by loading our packages and converting count_data to a tibble for easy printing:

pacman::p_load(tidyverse)

# Convert count_data to `tibble` for better printing
count_data <- 
  count_data %>% 
  as_tibble() 

count_data
## # A tibble: 3,038 x 10
##    location_name data_date  submitted_date Province District `malaria_rdt_0-…
##    <chr>         <date>     <date>         <chr>    <chr>               <int>
##  1 Facility 1    2019-06-13 2019-06-14     North    Spring                 11
##  2 Facility 2    2019-06-13 2019-06-14     North    Bolo                   11
##  3 Facility 3    2019-06-13 2019-06-14     North    Dingo                   8
##  4 Facility 4    2019-06-13 2019-06-14     North    Bolo                   16
##  5 Facility 5    2019-06-13 2019-06-14     North    Bolo                    9
##  6 Facility 6    2019-06-13 2019-06-14     North    Dingo                   3
##  7 Facility 6    2019-06-12 2019-06-14     North    Dingo                   4
##  8 Facility 5    2019-06-12 2019-06-14     North    Bolo                   15
##  9 Facility 5    2019-06-11 2019-06-14     North    Bolo                   11
## 10 Facility 5    2019-06-10 2019-06-14     North    Bolo                   19
## # … with 3,028 more rows, and 4 more variables: `malaria_rdt_5-14` <int>,
## #   malaria_rdt_15 <int>, malaria_tot <int>, newid <int>

Next, we want to use {tidyr}’s pivot_longer() function to convert the wide dataset to a long format, converting the four columns with data on malaria counts to two new columns: one which captures the variable name and one which captures the values from the cells. Since these four variables all begin with the prefix malaria_, we can make use of the handy function starts_with().

df_long <- 
  count_data %>% 
  pivot_longer(
    cols = starts_with("malaria_")
  )

df_long
## # A tibble: 12,152 x 8
##    location_name data_date  submitted_date Province District newid name    value
##    <chr>         <date>     <date>         <chr>    <chr>    <int> <chr>   <int>
##  1 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malari…    11
##  2 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malari…    12
##  3 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malari…    23
##  4 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malari…    46
##  5 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malari…    11
##  6 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malari…    10
##  7 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malari…     5
##  8 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malari…    26
##  9 Facility 3    2019-06-13 2019-06-14     North    Dingo        3 malari…     8
## 10 Facility 3    2019-06-13 2019-06-14     North    Dingo        3 malari…     5
## # … with 12,142 more rows

However, we could also have specified the columns by position:

count_data %>% 
  pivot_longer(
    cols = 6:9
  )
## # A tibble: 12,152 x 8
##    location_name data_date  submitted_date Province District newid name    value
##    <chr>         <date>     <date>         <chr>    <chr>    <int> <chr>   <int>
##  1 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malari…    11
##  2 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malari…    12
##  3 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malari…    23
##  4 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malari…    46
##  5 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malari…    11
##  6 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malari…    10
##  7 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malari…     5
##  8 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malari…    26
##  9 Facility 3    2019-06-13 2019-06-14     North    Dingo        3 malari…     8
## 10 Facility 3    2019-06-13 2019-06-14     North    Dingo        3 malari…     5
## # … with 12,142 more rows

or by named range:

count_data %>% 
  pivot_longer(
    cols = `malaria_rdt_0-4`:malaria_tot
  )
## # A tibble: 12,152 x 8
##    location_name data_date  submitted_date Province District newid name    value
##    <chr>         <date>     <date>         <chr>    <chr>    <int> <chr>   <int>
##  1 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malari…    11
##  2 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malari…    12
##  3 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malari…    23
##  4 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malari…    46
##  5 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malari…    11
##  6 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malari…    10
##  7 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malari…     5
##  8 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malari…    26
##  9 Facility 3    2019-06-13 2019-06-14     North    Dingo        3 malari…     8
## 10 Facility 3    2019-06-13 2019-06-14     North    Dingo        3 malari…     5
## # … with 12,142 more rows

Notice that the newly created dataframe (df_long) has more rows (12,152 vs 3,038); it has become longer. In fact, it is precisely four times as long, because each row in the original dataset now represents four rows in df_long, one for each of the malaria count observations (<4y, 5-14y, 15y+, and total).

In addition to becoming longer, the new dataset has fewer columns (8 vs 10), as the data previously stored in four columns (those beginning with the prefix malaria_) is now stored in two. These two columns are given the default names of name and value, but we can override these defaults to provide more meaningful names, which can help remember what is stored within, using the names_to and values_to arguments. Let’s use the names age_group and count:

df_long <- 
  count_data %>% 
  pivot_longer(
    cols = starts_with("malaria_"),
    names_to = "age_group",
    values_to = "counts"
  )

df_long
## # A tibble: 12,152 x 8
##    location_name data_date  submitted_date Province District newid age_group
##    <chr>         <date>     <date>         <chr>    <chr>    <int> <chr>    
##  1 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_…
##  2 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_…
##  3 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_…
##  4 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_…
##  5 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_…
##  6 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_…
##  7 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_…
##  8 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_…
##  9 Facility 3    2019-06-13 2019-06-14     North    Dingo        3 malaria_…
## 10 Facility 3    2019-06-13 2019-06-14     North    Dingo        3 malaria_…
## # … with 12,142 more rows, and 1 more variable: counts <int>

We can now pass this new dataset to {ggplot2} to display the malaria counts by age group:

ggplot(df_long) +
  geom_col(
    aes(x = data_date, y = counts, fill = age_group)
  )

Have a look at the plot - what is wrong here? We have encountered a common problem - we have also included the total counts from the malaria_tot column, so the magnitude of each bar in the plot is twice as high as it should be.

We can handle this in a number of ways. We could simply filter it from the dataset we pass to {ggplot2}:

df_long %>% 
  filter(age_group != "malaria_tot") %>% 
  ggplot() +
  geom_col(
    aes(x = data_date, y = counts, fill = age_group)
  )

Alternatively, we could have excluded this variable when we ran pivot_longer, thereby maintaining it in the dataset as a separate variable:

count_data %>% 
  pivot_longer(
    cols = `malaria_rdt_0-4`:malaria_rdt_15,
    names_to = "age_group",
    values_to = "counts"
  ) %>% 
  ggplot() +
  geom_col(
    aes(x = data_date, y = counts, fill = age_group)
  )

Long-to-wide

Transforming a dataset from long to wide

In some instances, we may wish to convert a dataset to a wider format. For this, we can use the pivot_wider() function.

A typical use case is when we want to transform the results of an analysis into a format which is more digestible for the reader. Typically, we are transforming a dataset in which the observations are spread over multiple rows to one in which each observation occupies a single row.

This introduces the useful topic of “tidy data”, in which each variable has it’s own column, each observation has it’s own row, and each value has it’s own cell. More about this topic can be found here https://r4ds.had.co.nz/tidy-data.html.

Data

Let us use the linelist dataset. Suppose that we want to know the counts of individuals in the different age groups, by sex:

linelist <- 
  linelist %>% 
  as_tibble()
  
df_wide <- 
  linelist %>% 
  count(age_cat, gender)

This gives us a long dataset that is great for visualisation, but not ideal for presentation in a table:

ggplot(df_wide) +
  geom_col(aes(x = age_cat, y = n, fill = gender))

Pivot wider

Therefore, we can use pivot_wider() to put this into a better format for inclusion as tables in our reports. The argument names_from specifies the column from which to generate the new column names, while the argument values_from specifies the column from which to take the values to populate the cells:

table_wide <- 
  df_wide %>% 
  pivot_wider(
    names_from = gender,
    values_from = n
  )

table_wide
## # A tibble: 8 x 4
##   age_cat     f     m  `NA`
##   <fct>   <int> <int> <int>
## 1 0-4       612   410    32
## 2 5-9       648   466    39
## 3 10-14     490   423    44
## 4 15-19     422   435    32
## 5 20-29     425   606    34
## 6 30-49     168   478    12
## 7 50-69       1    23    NA
## 8 <NA>       NA    NA    88

This table is much nicer for inclusion in our reports:

table_wide %>% 
  janitor::adorn_totals(c("row", "col")) %>% # adds a total row and column
  knitr::kable() %>% 
  kableExtra::row_spec(row = 9, bold = TRUE) %>% 
  kableExtra::column_spec(column = 5, bold = TRUE) 
age_cat f m NA Total
0-4 612 410 32 1054
5-9 648 466 39 1153
10-14 490 423 44 957
15-19 422 435 32 889
20-29 425 606 34 1065
30-49 168 478 12 658
50-69 1 23 NA 24
NA NA NA 88 88
Total 2766 2841 281 5888

Fill

Filling in missing data

Data

In some situations after a pivot, and more commonly after a bind, we are left with gaps in some cells that we would like to fill. For example, take two datasets, each with observations for the measurement number, the name of the facility, and the case count at that time. However, the second dataset also has a variable Year. When we perform a bind_rows() to join the two datasets together, the Year variable is filled with NA for those rows where there was no prior information (i.e. the first dataset):

df1 <- 
  tibble::tribble(
       ~Measurement, ~Facility, ~Cases,
                  1,  "Hosp 1",     66,
                  2,  "Hosp 1",     26,
                  3,  "Hosp 1",      8,
                  1,  "Hosp 2",     71,
                  2,  "Hosp 2",     62,
                  3,  "Hosp 2",     70,
                  1,  "Hosp 3",     47,
                  2,  "Hosp 3",     70,
                  3,  "Hosp 3",     38,
       )

df1 
## # A tibble: 9 x 3
##   Measurement Facility Cases
##         <dbl> <chr>    <dbl>
## 1           1 Hosp 1      66
## 2           2 Hosp 1      26
## 3           3 Hosp 1       8
## 4           1 Hosp 2      71
## 5           2 Hosp 2      62
## 6           3 Hosp 2      70
## 7           1 Hosp 3      47
## 8           2 Hosp 3      70
## 9           3 Hosp 3      38

df2 <- 
  tibble::tribble(
    ~Year, ~Measurement, ~Facility, ~Cases,
     2000,            1,  "Hosp 4",     82,
     2001,            2,  "Hosp 4",     87,
     2002,            3,  "Hosp 4",     46
  )

df2
## # A tibble: 3 x 4
##    Year Measurement Facility Cases
##   <dbl>       <dbl> <chr>    <dbl>
## 1  2000           1 Hosp 4      82
## 2  2001           2 Hosp 4      87
## 3  2002           3 Hosp 4      46

df_combined <- 
  bind_rows(df1, df2) %>% 
  arrange(Measurement, Facility)

df_combined
## # A tibble: 12 x 4
##    Measurement Facility Cases  Year
##          <dbl> <chr>    <dbl> <dbl>
##  1           1 Hosp 1      66    NA
##  2           1 Hosp 2      71    NA
##  3           1 Hosp 3      47    NA
##  4           1 Hosp 4      82  2000
##  5           2 Hosp 1      26    NA
##  6           2 Hosp 2      62    NA
##  7           2 Hosp 3      70    NA
##  8           2 Hosp 4      87  2001
##  9           3 Hosp 1       8    NA
## 10           3 Hosp 2      70    NA
## 11           3 Hosp 3      38    NA
## 12           3 Hosp 4      46  2002

fill()

In this case, Year is a useful variable to include, particularly if we want to explore trends over time. Therefore, we use fill() to fill in those empty cells, by specifying the column to fill and the direction (in this case up):

df_combined %>% 
  fill(Year, .direction = "up")
## # A tibble: 12 x 4
##    Measurement Facility Cases  Year
##          <dbl> <chr>    <dbl> <dbl>
##  1           1 Hosp 1      66  2000
##  2           1 Hosp 2      71  2000
##  3           1 Hosp 3      47  2000
##  4           1 Hosp 4      82  2000
##  5           2 Hosp 1      26  2001
##  6           2 Hosp 2      62  2001
##  7           2 Hosp 3      70  2001
##  8           2 Hosp 4      87  2001
##  9           3 Hosp 1       8  2002
## 10           3 Hosp 2      70  2002
## 11           3 Hosp 3      38  2002
## 12           3 Hosp 4      46  2002

We can rearrange the data so that we would need to fill in a downward direction:

df_combined <- 
  df_combined %>% 
  arrange(Measurement, desc(Facility))

df_combined
## # A tibble: 12 x 4
##    Measurement Facility Cases  Year
##          <dbl> <chr>    <dbl> <dbl>
##  1           1 Hosp 4      82  2000
##  2           1 Hosp 3      47    NA
##  3           1 Hosp 2      71    NA
##  4           1 Hosp 1      66    NA
##  5           2 Hosp 4      87  2001
##  6           2 Hosp 3      70    NA
##  7           2 Hosp 2      62    NA
##  8           2 Hosp 1      26    NA
##  9           3 Hosp 4      46  2002
## 10           3 Hosp 3      38    NA
## 11           3 Hosp 2      70    NA
## 12           3 Hosp 1       8    NA

df_combined <- 
  df_combined %>% 
  fill(Year, .direction = "down")

df_combined
## # A tibble: 12 x 4
##    Measurement Facility Cases  Year
##          <dbl> <chr>    <dbl> <dbl>
##  1           1 Hosp 4      82  2000
##  2           1 Hosp 3      47  2000
##  3           1 Hosp 2      71  2000
##  4           1 Hosp 1      66  2000
##  5           2 Hosp 4      87  2001
##  6           2 Hosp 3      70  2001
##  7           2 Hosp 2      62  2001
##  8           2 Hosp 1      26  2001
##  9           3 Hosp 4      46  2002
## 10           3 Hosp 3      38  2002
## 11           3 Hosp 2      70  2002
## 12           3 Hosp 1       8  2002

This dataset is now useful for plotting:

ggplot(df_combined) +
  aes(Year, Cases, fill = Facility) +
  geom_col()

But less useful for presenting in a table, so let’s practice converting this long, untidy dataframe into a wider, tidy dataframe:

df_combined %>% 
  pivot_wider(
    id_cols = c(Facility, Year, Cases),
    names_from = "Year",
    values_from = "Cases"
  ) %>% 
  arrange(Facility) %>% 
  janitor::adorn_totals(c("row", "col")) %>% 
  knitr::kable() %>% 
  kableExtra::row_spec(row = 5, bold = TRUE) %>% 
  kableExtra::column_spec(column = 5, bold = TRUE) 
Facility 2000 2001 2002 Total
Hosp 1 66 26 8 100
Hosp 2 71 62 70 203
Hosp 3 47 70 38 155
Hosp 4 82 87 46 215
Total 266 245 162 673

N.B. In this case, we had to specify to only include the three variables Facility, Year, and Cases as the additional variable Measurement would interfere with the creation of the table:

df_combined %>% 
  pivot_wider(
    names_from = "Year",
    values_from = "Cases"
  ) %>% 
  knitr::kable()
Measurement Facility 2000 2001 2002
1 Hosp 4 82 NA NA
1 Hosp 3 47 NA NA
1 Hosp 2 71 NA NA
1 Hosp 1 66 NA NA
2 Hosp 4 NA 87 NA
2 Hosp 3 NA 70 NA
2 Hosp 2 NA 62 NA
2 Hosp 1 NA 26 NA
3 Hosp 4 NA NA 46
3 Hosp 3 NA NA 38
3 Hosp 2 NA NA 70
3 Hosp 1 NA NA 8